The Used Car Price Prediction dataset contains 4,009 vehicle listings collected from the automotive marketplace cars.com. Each row represents a unique car and includes nine key attributes relevant to pricing and vehicle characteristics. Dataset is taken from Kaggle: https://www.kaggle.com/datasets/taeefnajib/used-car-price-prediction-dataset
The dataset provides information on:
Brand and model – manufacturer and specific vehicle model
Model year – age of the car, influencing depreciation
Mileage – an indicator of usage and wear
Fuel type – e.g., gasoline, diesel, electric, hybrid
Engine type – performance and efficiency characteristics
Transmission – automatic or manual
Exterior/interior colors – aesthetic properties
Accident history – whether the car has previously been damaged
Clean title – legal/ownership status
Price – listed price of the vehicle
Overall, the dataset offers a structured overview of key features that influence used car valuation. It is well-suited for analytical tasks such as understanding pricing drivers, exploring consumer preferences, and building predictive models for vehicle prices. # Raw data
We load the original CSV directly from the project data folder using
here() so paths work regardless of the working
directory.
raw_path <- here("data", "raw", "used_cars.csv")
cars_raw <- readr::read_csv(raw_path, show_col_types = FALSE)
Basic structure and summary statistics of the raw dataset:
glimpse(cars_raw)
## Rows: 4,009
## Columns: 12
## $ brand <chr> "Ford", "Hyundai", "Lexus", "INFINITI", "Audi", "Acura", ~
## $ model <chr> "Utility Police Interceptor Base", "Palisade SEL", "RX 35~
## $ model_year <dbl> 2013, 2021, 2022, 2015, 2021, 2016, 2017, 2001, 2021, 202~
## $ milage <chr> "51,000 mi.", "34,742 mi.", "22,372 mi.", "88,900 mi.", "~
## $ fuel_type <chr> "E85 Flex Fuel", "Gasoline", "Gasoline", "Hybrid", "Gasol~
## $ engine <chr> "300.0HP 3.7L V6 Cylinder Engine Flex Fuel Capability", "~
## $ transmission <chr> "6-Speed A/T", "8-Speed Automatic", "Automatic", "7-Speed~
## $ ext_col <chr> "Black", "Moonlight Cloud", "Blue", "Black", "Glacier Whi~
## $ int_col <chr> "Black", "Gray", "Black", "Black", "Black", "Ebony.", "Bl~
## $ accident <chr> "At least 1 accident or damage reported", "At least 1 acc~
## $ clean_title <chr> "Yes", "Yes", NA, "Yes", NA, NA, "Yes", "Yes", "Yes", "Ye~
## $ price <chr> "$10,300", "$38,005", "$54,598", "$15,500", "$34,999", "$~
We base the EDA on the engineered dataset
(data/processed/used_cars_features.csv) that keeps cleaned
numeric fields and derived features like age, mileage in thousands, and
accident flags.
features_path <- here("data", "processed", "used_cars_features.csv")
cars <- readr::read_delim(features_path, delim = ";", show_col_types = FALSE)
| variable | median | mean | p25 | p75 | sd | min | max |
|---|---|---|---|---|---|---|---|
| price_dollar | 28000.00 | 36865.68 | 15500.00 | 46999.00 | 36531.16 | 2000.0 | 649999.00 |
| log_price | 10.24 | 10.19 | 9.65 | 10.76 | 0.82 | 7.6 | 13.38 |
| age | 9.00 | 10.32 | 6.00 | 14.00 | 5.87 | 1.0 | 29.00 |
| milage_k | 63.00 | 72.14 | 30.00 | 103.00 | 53.60 | 0.0 | 405.00 |
| horsepower | 310.00 | 331.51 | 248.00 | 400.00 | 120.32 | 76.0 | 1020.00 |
| accident | n | share |
|---|---|---|
| At least 1 accident or damage reported | 871 | 0.28 |
| None reported | 2194 | 0.72 |
Median listing sits around $28k, with the middle 50% between roughly $15.5k and $47k, while the maximum reaches $650k—explaining the heavy right tail. Median age is 9 years (IQR: 6–14), typical mileage is about 63k miles (IQR: 30k–103k), and horsepower clusters around 310 HP (IQR: 248–400). About 28% of cars report an accident or damage, a meaningful factor for pricing.
Raw prices are extremely right-skewed, with most listings below $80k but a long tail of luxury and exotic vehicles. Modeling on this scale would be dominated by a few high-price outliers.
Log transformation produces a more bell-shaped distribution and stabilizes variance, making linear-style models and visual comparisons more reliable.
Prices decline with age across fuels. Electric listings start high but show the sharpest early drop; diesel holds comparatively high prices across ages (though the diesel sample is small), and gasoline sits lower overall.
Among the 12 most common brands, Porsche leads on median price, followed by Land Rover and Mercedes-Benz; Volume brands (Toyota, Nissan, Jeep) cluster lower with tighter spreads, while some (Chevrolet, Ford) span broader lineups.
Higher mileage correlates with lower prices. We use a loess smoother (not a straight trendline) and cap the x-axis at 250k miles to reduce the influence of extreme outliers; automatics show a steady decline, and the smaller manual subset is noisier but similar in direction.
Cars with reported accidents trade at a clear discount relative to clean histories, even after log-scaling prices, confirming accident history as an important predictor.
In Section 3 we saw that price is strongly right-skewed and that log(price) has an approximately linear relationship with age. Based on this, we now fit a linear regression model with log(price) as response.
The goal is not primarily to build the most accurate predictor, but to understand which factors drive used-car prices and in which direction.
We model the natural logarithm of the price instead of the raw price because:
We use the feature dataset created in the previous step
(data/processed/used_cars_features.csv) and fit the
following multiple linear regression model on the log-price:
log_price ~ age + milage_k + accident_bin + brand + fuel_type + transmission + ext_col + int_col
Here
log_price is the natural logarithm of the car price in
dollars (the target variable / response).age is the car age in years.milage_k is the mileage in thousands of miles.accident_bin is a binary variablebrand, fuel_type,
transmission, ext_col and int_col
are categorical predictors and are represented in the model by dummy
variables with one reference category each.The coefficients of this model measure how much the expected log-price changes when we increase a numeric predictor by one unit (or switch a dummy variable from 0 to 1), while keeping all other variables fixed.
We implement and fit the model by sourcing the script
src/04_model_linear.R, which
used_cars_features.csv),stats::lm(...), and# Fit the baseline linear regression model and prepare lm_linear_data
source(here::here("src", "04_model_linear.R"))
# Show a compact summary of the model
broom::tidy(lm_linear) |>
dplyr::slice(1:10) # show only first 10 coefficients
## # A tibble: 10 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 11.4 0.129 87.8 0
## 2 age -0.0572 0.00213 -26.8 2.01e-138
## 3 milage_k -0.00597 0.000223 -26.8 1.79e-138
## 4 accident_bin -0.0872 0.0198 -4.40 1.12e- 5
## 5 brandAlfa 0.0348 0.174 0.200 8.41e- 1
## 6 brandAudi 0.252 0.0847 2.98 2.95e- 3
## 7 brandBMW 0.322 0.0804 4.01 6.38e- 5
## 8 brandBentley 1.31 0.126 10.4 9.21e- 25
## 9 brandBuick -0.0200 0.152 -0.132 8.95e- 1
## 10 brandCadillac 0.304 0.0912 3.33 8.72e- 4
| metric | value |
|---|---|
| RMSE (log-price) | 0.411 |
| R squared (log-price) | 0.733 |
Because the model is fitted on the log(price) scale, each coefficient can be read approximately as a percentage change in price when we increase that variable by one unit (or switch a dummy variable from 0 to 1), while keeping all other variables fixed. Roughly, a coefficient of −0.06 means “about −6 %”, a coefficient of +0.20 means “about +22 %”, and so on.
Below we interpret a few selected coefficients from the model.
Age (coefficient ≈ −0.058)
Holding all other variables constant, increasing the age of a car by one
year reduces the expected price by about 6 %
(exp(−0.058) ≈ 0.94).
→ Older cars are substantially cheaper, as expected.
Mileage in thousands (coefficient ≈
−0.006)
An additional 1,000 miles reduces the expected price by roughly
0.6 %.
→ The effect of mileage is noticeable but smaller than the effect of
age.
Accident history (accident_bin, coefficient
≈ −0.080)
accident_bin = 1 indicates that at least one accident or
damage has been reported. Compared to a car with no accident history
(accident_bin = 0), the expected price is lower by about
8 %.
→ Cars with an accident history sell for clearly lower prices.
Brand examples
The brand coefficients show how each brand differs from the (omitted)
reference brand, holding age, mileage, accident history and all other
variables constant.
brandBMW (coefficient ≈ 0.28): price is about
30 % higher than for the reference brand.
→ BMW cars are noticeably more expensive, even after controlling for
other factors.
brandFerrari (coefficient ≈ 1.82): price is more
than six times higher (over +500 %) than for the
reference brand.
→ Ferrari appears as an extreme premium brand in this dataset.
Fuel type example (fuel_typeElectric,
coefficient ≈ −0.79)
For purely electric cars, the coefficient corresponds to a price that is
roughly 55 % lower than for the reference fuel type,
given the same age, mileage, brand etc.
→ In this sample, electric cars are priced clearly below comparable
vehicles with the reference fuel type (possibly due to different model
mix, range concerns, or incentives on combustion cars).
Transmission (transmissionManual,
coefficient ≈ 0.20)
Manual transmission has a coefficient of about 0.20, which translates to
prices that are roughly 20–22 % higher than for the
reference transmission type.
→ Cars with manual transmission are on average more expensive in this
sample.
Overall, the signs and magnitudes of the coefficients are plausible: older and higher-mileage cars and cars with accidents are cheaper, while premium brands and certain configurations (e.g. BMW, Ferrari) command much higher prices. With an \(R^2\) of about 0.75 on the log-price scale, the model explains a large share of the variation in used car prices, even though there is still substantial unexplained variability.
To visualise the effect of age in the linear model, we plot log-price against age and add a fitted linear trend line.
The plot shows the expected negative relationship: older cars tend to have a lower log-price. In our model, the age coefficient is about −0.058, which means that, holding all other variables constant, one additional year of age reduces the expected price by roughly 6 % (because exp(−0.058) ≈ 0.94).
To check whether the linear model assumptions are roughly satisfied, we plot the residuals against the fitted (predicted) log-prices. Ideally, the points are scattered randomly around zero without a clear pattern.
In our plot, the residuals are roughly centred around zero with no
strong curvature. There is some increase in spread for higher predicted
prices, but overall the linear model assumptions appear acceptable.
Overall, the linear regression on log(price) provides a simple and interpretable summary of the main price drivers in this dataset. Age and mileage have the expected negative impact on prices, while an accident history leads to a price discount of roughly 8 %. Premium brands such as BMW and especially Ferrari command large price premia, even after controlling for age, mileage and fuel type. The model explains about 75 % of the variance in log-prices, which is quite high for real-world data, but the residual plot also shows that there is still substantial unexplained variability. For a client, this model could already be used as a rough pricing guideline, but more advanced models (e.g. with interactions or nonlinear effects) might capture the remaining structure in the data even better.
A binomial Generalized Linear Model (GLM) is appropriate when the
response variable is binary.
Here, the response variable is:
accident_bin
1 = car has had at least one reported accident0 = no reported accidentThe model estimates the probability that a used car has an accident history based on vehicle characteristics.
We model accident history using a logistic regression with a logit link:
\[ \log\left(\frac{P(\text{accident}=1)}{1 - P(\text{accident}=1)}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_p X_p \]
Model formula:
accident_bin ~ brand + age + milage_k + fuel_type + transmission + price_dollar + horsepower + cylinders
accident_bin (binary)# Source the script that fits the binomial GLM
source(here::here("src", "05_model_glm_binomial.R"))
# Show a full summary of the model
summary(glm.car)
##
## Call:
## glm(formula = accident_bin ~ brand + age + milage_k + fuel_type +
## transmission + price_dollar + horsepower + cylinders, family = "binomial",
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.565e-01 4.880e-01 -1.550 0.1211
## brandAlfa -1.102e+00 1.113e+00 -0.990 0.3221
## brandAudi 2.338e-01 3.826e-01 0.611 0.5412
## brandBMW -1.410e-01 3.655e-01 -0.386 0.6997
## brandBentley 7.038e-02 7.604e-01 0.093 0.9263
## brandBuick 3.798e-01 6.019e-01 0.631 0.5280
## brandCadillac 4.799e-02 4.195e-01 0.114 0.9089
## brandChevrolet -2.149e-01 3.804e-01 -0.565 0.5721
## brandChrysler -6.488e-02 5.726e-01 -0.113 0.9098
## brandDodge 8.623e-02 4.292e-01 0.201 0.8408
## brandFerrari 1.980e+00 1.275e+00 1.553 0.1205
## brandFord -8.352e-02 3.679e-01 -0.227 0.8204
## brandGMC 1.849e-01 4.318e-01 0.428 0.6685
## brandGenesis -1.373e+00 1.095e+00 -1.254 0.2097
## brandHonda -2.186e-01 4.732e-01 -0.462 0.6441
## brandHummer 3.534e-01 6.157e-01 0.574 0.5660
## brandHyundai 1.307e-01 4.424e-01 0.295 0.7677
## brandINFINITI -2.661e-01 4.589e-01 -0.580 0.5621
## brandJaguar -2.984e-02 5.020e-01 -0.059 0.9526
## brandJeep -3.128e-01 4.153e-01 -0.753 0.4513
## brandKia -1.944e-01 4.982e-01 -0.390 0.6964
## brandLamborghini -1.127e+01 3.235e+02 -0.035 0.9722
## brandLand -7.250e-01 4.439e-01 -1.633 0.1025
## brandLexus 3.271e-01 3.909e-01 0.837 0.4027
## brandLincoln -1.301e-01 4.753e-01 -0.274 0.7843
## brandMINI -1.140e+00 6.409e-01 -1.778 0.0754 .
## brandMaserati -1.064e+00 7.121e-01 -1.494 0.1352
## brandMazda -4.210e-02 5.076e-01 -0.083 0.9339
## brandMercedes-Benz 8.343e-02 3.713e-01 0.225 0.8222
## brandMitsubishi 1.051e-01 5.975e-01 0.176 0.8604
## brandNissan 1.616e-01 4.075e-01 0.396 0.6918
## brandPontiac -3.582e-01 6.867e-01 -0.522 0.6019
## brandPorsche 2.548e-01 4.055e-01 0.629 0.5297
## brandRAM 3.967e-01 4.589e-01 0.864 0.3873
## brandRivian -1.300e+01 3.749e+02 -0.035 0.9723
## brandSubaru 2.139e-01 4.458e-01 0.480 0.6313
## brandTesla -1.094e+00 8.824e-01 -1.240 0.2150
## brandToyota -1.868e-01 3.866e-01 -0.483 0.6289
## brandVolkswagen -3.929e-01 4.792e-01 -0.820 0.4123
## brandVolvo -5.475e-01 5.440e-01 -1.006 0.3142
## age -1.164e-02 1.190e-02 -0.978 0.3281
## milage_k 6.995e-03 1.150e-03 6.083 1.18e-09 ***
## fuel_typeE85 Flex Fuel -1.023e-01 3.112e-01 -0.329 0.7424
## fuel_typeElectric -1.286e+00 6.523e-01 -1.972 0.0486 *
## fuel_typeGasoline -1.889e-01 2.541e-01 -0.743 0.4574
## fuel_typeHybrid -2.298e-01 3.338e-01 -0.689 0.4911
## fuel_typePlug-In Hybrid 4.227e-01 4.819e-01 0.877 0.3803
## transmissionManual -2.091e-01 1.606e-01 -1.302 0.1929
## price_dollar -2.169e-05 3.845e-06 -5.642 1.68e-08 ***
## horsepower 1.829e-03 8.796e-04 2.079 0.0376 *
## cylinders -3.470e-02 5.214e-02 -0.665 0.5058
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3658.7 on 3064 degrees of freedom
## Residual deviance: 3306.0 on 3014 degrees of freedom
## AIC: 3408
##
## Number of Fisher Scoring iterations: 14
We scale and exponentiate the coefficients of the binomial GLM to obtain odds ratios, which show how the odds of a car having an accident history change with each predictor.
To improve interpretability, continuous predictors are rescaled before computing odds ratios. This rescaling does not affect statistical significance or model fit, only the unit of interpretation.
The data were scaled as follows:
milage_k was multiplied by 10.price_dollar was multiplied by 1,000.horsepower was multiplied by 50.| term | odds_ratio | pct_change | p.value | signif |
|---|---|---|---|---|
| milage_k | 1.072 | 7.245 | 0.000 | |
| fuel_typeElectric | 0.276 | -72.374 | 0.049 | |
| price_dollar | 0.979 | -2.146 | 0.000 | |
| horsepower | 1.096 | 9.575 | 0.038 |
Interpretation of key predictors:
Mileage (milage_k): Odds ratio >
1 → Higher mileage slightly increases the odds of a reported accident.
Specifically, an additional 10,000 miles increases the odds by ~7%,
reflecting increased usage and exposure.
Price (price_dollar): Odds ratio
< 1 → Higher-priced vehicles are less likely to have an accident.
Each additional $1,000 reduces the odds by ~2%, suggesting that more
expensive cars are generally newer or better maintained.
Horsepower (horsepower): Odds ratio
> 1 → More powerful cars slightly increase accident odds. For
example, a 50 HP increase raises the odds by ~9.5%, possibly due to more
spirited driving.
Fuel type – Electric
(fuel_typeElectric): Odds ratio << 1 →
Electric cars have substantially lower odds (≈ 72 % lower) of reported
accidents compared to the reference fuel type. This effect should be
interpreted cautiously, as electric vehicles in this dataset are
generally newer and may benefit from modern safety
technologies.
These odds ratios allow for an intuitive interpretation of how each variable affects accident probability, holding all other variables constant. Most other predictors (brand, transmission, cylinders, age) were not statistically significant in this model.
The binomial GLM indicates that accident probability is primarily driven by vehicle usage and value. Higher mileage and higher horsepower are associated with increased accident odds, while higher-priced cars are less likely to have a reported accident history. This is consistent with the intuition that cheaper cars tend to be older, more heavily used, and exposed to greater wear and risk. Electric vehicles exhibit substantially lower accident odds in the model. However, this result is likely influenced by sample composition: electric cars in the dataset are generally newer and therefore have had less exposure time to accidents. The observed fuel-type effect may therefore partially reflect vehicle age rather than an intrinsic safety advantage. Overall, the GLM provides a transparent and interpretable baseline model for accident risk. Its linear structure, however, limits its ability to capture nonlinear exposure effects, motivating the use of a Generalised Additive Model in the next section.
A Poisson GLM is typically used when the response variable is
a count (non-negative integers). Because this dataset does not
contain a natural event count, we use a pragmatic proxy: mileage
in thousands (milage_k), converted to an integer
“count-like” variable (milage_k_count).
This section mainly demonstrates the Poisson GLM workflow and
interpretation.
We model the expected mileage count (in thousands) using a Poisson GLM with a log link:
milage_k_count ~ age + accident_bin + brand + fuel_type + transmission + ext_col + int_col
In a Poisson GLM with log link:
milage_k_count (non-negative integer
proxy)log( E[milage_k_count] ) = linear predictorSo coefficients are interpreted
multiplicatively:
exp(beta) is the factor change in the expected count for a
one-unit increase (or dummy switch), holding other variables fixed.
We fit the Poisson GLM by sourcing
src/06_model_glm_poisson.R, which: 1) loads the feature
data,
2) creates milage_k_count,
3) fits a Poisson GLM with log link, and
4) adds predictions/residuals back to the data.
source(here::here("src", "06_model_glm_poisson.R"))
# compact coefficient table incl. significance stars
poisson_tbl <- broom::tidy(glm_poisson) |>
dplyr::mutate(
signif = dplyr::case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
)
)
poisson_tbl |>
dplyr::select(term, estimate, std.error, statistic, p.value, signif) |>
dplyr::slice(1:12)
## # A tibble: 12 x 6
## term estimate std.error statistic p.value signif
## <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 (Intercept) 3.78 0.0337 112. 0 ***
## 2 age 0.0685 0.000464 147. 0 ***
## 3 accident_bin 0.207 0.00511 40.4 0 ***
## 4 brandAlfa -0.515 0.0652 -7.91 2.67e-15 ***
## 5 brandAudi -0.0884 0.0232 -3.81 1.40e- 4 ***
## 6 brandBMW -0.199 0.0220 -9.07 1.23e-19 ***
## 7 brandBentley -0.965 0.0469 -20.6 5.44e-94 ***
## 8 brandBuick -0.182 0.0342 -5.32 1.07e- 7 ***
## 9 brandCadillac -0.0465 0.0243 -1.91 5.59e- 2 .
## 10 brandChevrolet -0.227 0.0224 -10.1 4.02e-24 ***
## 11 brandChrysler -0.279 0.0337 -8.29 1.13e-16 ***
## 12 brandDodge -0.0689 0.0251 -2.74 6.11e- 3 **
## AIC pseudo_R2_deviance dispersion_pearson
## 1 63503.28 0.4928358 20.50528
If the dispersion is clearly above 1, the Poisson variance assumption (mean ≈ variance) may be violated (overdispersion). In a real use case you’d consider a quasi-Poisson or negative binomial model. We keep Poisson here for comparability and interpretation practice.
In a Poisson GLM with log link:
exp(beta) > 1 → expected count increasesexp(beta) < 1 → expected count decreases(exp(beta) - 1) * 100## # A tibble: 4 x 5
## term estimate rate_ratio pct_change p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 age 0.0685 1.07 7.1 0
## 2 accident_bin 0.207 1.23 22.9 0
## 3 fuel_typeElectric -1.01 0.364 -63.6 2.88e-156
## 4 transmissionManual -0.311 0.733 -26.7 1.23e-272
The Poisson GLM demonstrates how to model a count-like response with
a log link and interpret effects as multiplicative changes via
exp(beta). While milage_k_count is not a true
event count, this workflow is useful to practice GLM estimation,
interpretation, and basic diagnostics. In a real client setting with
genuine count outcomes, overdispersion should be checked carefully and a
quasi-Poisson or negative binomial model may be more appropriate.
To relax the linearity assumption of the binomial GLM, we fit a Generalised Additive Model (GAM) to predict whether a car has had at least one reported accident (accident_bin). The GAM allows for non-linear relationships between key continuous predictors and accident probability, while retaining interpretable parametric effects for categorical variables.
We model accident history using a GAM with a binomial family and logit link:
accident_bin ~ brand + fuel_type + transmission + cylinders + s(milage_k) + s(price_dollar) + s(horsepower) + s(age)
Where: - accident_bin is a binary response (0 = no
reported accident, 1 = at least one reported accident). -
brand, fuel_type, and
transmission are categorical predictors. -
cylinders enters the model linearly. -
s(milage_k), s(price_dollar),
s(horsepower), and s(age) are smooth spline
functions capturing potential nonlinear effects. - The model is
estimated using penalised likelihood with REML.
Smoothness is selected automatically via REML, reducing the risk of overfitting despite flexible spline terms.
source(here::here("src", "07_model_gam.R"))
# Show a full summary of the model
summary(gam.car)
##
## Family: binomial
## Link function: logit
##
## Formula:
## accident_bin ~ brand + fuel_type + transmission + cylinders +
## s(milage_k) + s(price_dollar) + s(horsepower) + s(age)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.419e-01 5.449e-01 -1.545 0.1223
## brandAlfa -1.006e+00 1.123e+00 -0.895 0.3705
## brandAudi 1.611e-01 3.881e-01 0.415 0.6780
## brandBMW -1.905e-01 3.710e-01 -0.513 0.6077
## brandBentley -3.536e-01 7.696e-01 -0.459 0.6459
## brandBuick 3.009e-01 6.049e-01 0.497 0.6189
## brandCadillac 7.582e-03 4.240e-01 0.018 0.9857
## brandChevrolet -1.726e-01 3.858e-01 -0.447 0.6546
## brandChrysler -8.121e-02 5.774e-01 -0.141 0.8881
## brandDodge 1.182e-01 4.339e-01 0.272 0.7853
## brandFerrari 9.463e-01 1.290e+00 0.733 0.4634
## brandFord -6.184e-02 3.736e-01 -0.166 0.8685
## brandGMC 2.126e-01 4.378e-01 0.486 0.6272
## brandGenesis -1.085e+00 1.114e+00 -0.974 0.3303
## brandHonda -2.836e-01 4.795e-01 -0.591 0.5543
## brandHummer 4.498e-01 6.215e-01 0.724 0.4692
## brandHyundai 9.541e-02 4.502e-01 0.212 0.8322
## brandINFINITI -2.988e-01 4.638e-01 -0.644 0.5195
## brandJaguar -7.691e-02 5.063e-01 -0.152 0.8793
## brandJeep -3.293e-01 4.214e-01 -0.781 0.4346
## brandKia -1.080e-01 5.125e-01 -0.211 0.8331
## brandLamborghini -3.667e+01 1.861e+07 0.000 1.0000
## brandLand -7.975e-01 4.482e-01 -1.779 0.0752 .
## brandLexus 2.860e-01 3.964e-01 0.722 0.4706
## brandLincoln -1.300e-01 4.806e-01 -0.270 0.7868
## brandMINI -1.305e+00 6.480e-01 -2.014 0.0440 *
## brandMaserati -1.126e+00 7.196e-01 -1.565 0.1177
## brandMazda -1.007e-01 5.153e-01 -0.195 0.8451
## brandMercedes-Benz -2.690e-02 3.773e-01 -0.071 0.9432
## brandMitsubishi 4.319e-02 6.045e-01 0.071 0.9430
## brandNissan 1.118e-01 4.120e-01 0.271 0.7861
## brandPontiac -2.782e-01 6.905e-01 -0.403 0.6870
## brandPorsche 1.008e-01 4.148e-01 0.243 0.8080
## brandRAM 3.681e-01 4.656e-01 0.791 0.4292
## brandRivian -3.651e+01 1.733e+07 0.000 1.0000
## brandSubaru 1.435e-01 4.535e-01 0.316 0.7516
## brandTesla -1.148e+00 9.016e-01 -1.273 0.2028
## brandToyota -1.912e-01 3.927e-01 -0.487 0.6263
## brandVolkswagen -4.975e-01 4.860e-01 -1.024 0.3061
## brandVolvo -5.360e-01 5.495e-01 -0.975 0.3294
## fuel_typeE85 Flex Fuel -1.763e-01 3.137e-01 -0.562 0.5741
## fuel_typeElectric -8.444e-01 6.694e-01 -1.261 0.2072
## fuel_typeGasoline -1.789e-01 2.572e-01 -0.695 0.4868
## fuel_typeHybrid -1.172e-01 3.416e-01 -0.343 0.7315
## fuel_typePlug-In Hybrid 6.662e-01 5.035e-01 1.323 0.1857
## transmissionManual -1.408e-01 1.626e-01 -0.866 0.3866
## cylinders 7.551e-03 5.476e-02 0.138 0.8903
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(milage_k) 5.696 6.819 45.201 < 2e-16 ***
## s(price_dollar) 2.266 2.989 14.054 0.00284 **
## s(horsepower) 2.212 2.862 4.814 0.21619
## s(age) 3.077 3.898 8.022 0.07772 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.102 Deviance explained = 11.1%
## -REML = 1623.1 Scale est. = 1 n = 3065
The summary output reports estimated degrees of freedom (EDF) and approximate significance tests for each smooth term.
The GAM summary reports estimated degrees of freedom (EDF) and significance tests for each smooth term, indicating whether nonlinear effects are present.
Mileage (s(milage_k))
Mileage shows a highly significant and strongly nonlinear
effect on accident probability (EDF ≈ 5.7, p < 2e−16). The
relatively large EDF indicates a complex relationship: accident risk
does not increase at a constant rate with mileage. Instead, the effect
varies across usage levels, reflecting changing exposure and driving
patterns over a vehicle’s lifetime.
Price (s(price_dollar))
Price has a statistically significant smooth effect
(EDF ≈ 2.3, p = 0.0028). While the effect is nonlinear, its lower EDF
suggests a shape close to linear. Higher-priced vehicles tend to have a
lower probability of reported accidents, consistent with newer vehicles,
better maintenance, and improved safety features.
Horsepower (s(horsepower))
The smooth effect of horsepower is not statistically
significant (p = 0.216). Although horsepower is modelled
flexibly, the data provide no strong evidence that engine power
independently affects accident probability once mileage, price, and
other characteristics are controlled for.
Age (s(age))
The smooth effect of age is marginally insignificant (p
= 0.078). This suggests that age alone does not strongly predict
accident probability after accounting for mileage and vehicle
characteristics. Much of the age-related risk appears to be captured
indirectly through mileage.
Overall, mileage and price emerge as the most important nonlinear predictors of accident history in the GAM.
The parametric terms (brand, fuel type, transmission, and cylinders) are interpreted as in a standard logistic regression, holding all smooth effects constant.
Brand effects
Most brand coefficients are not statistically significant, indicating
limited brand-specific differences in accident probability after
controlling for mileage, price, and other variables. One exception is
MINI, which shows a significantly lower accident
probability compared to the reference brand (p = 0.044).
Fuel type
Fuel type coefficients are not statistically significant. Electric
vehicles show a negative coefficient, suggesting lower accident odds,
but this effect is not significant once nonlinear mileage and price
effects are accounted for.
Transmission
Manual transmission does not have a statistically significant effect on
accident probability in this model.
Cylinders
The number of cylinders enters linearly and is not statistically
significant, suggesting little direct impact on accident risk.
Overall, categorical vehicle characteristics play a secondary role compared to usage-related variables such as mileage and price.
The Generalised Additive Model identifies mileage as the dominant predictor of accident risk, showing a highly significant and nonlinear relationship with the probability of reported accidents. Vehicle price also matters: lower-priced cars are more likely to have had accidents. Other continuous predictors, including horsepower and age, show no strong independent effects once the nonlinear patterns of mileage and price are accounted for. Most categorical variables brand, fuel type, transmission, and cylinders are not statistically significant, suggesting that accident risk is driven primarily by usage intensity and vehicle value, rather than brand identity. Compared to the linear binomial GLM, the GAM captures nonlinear effects and provides a more nuanced and interpretable view of accident risk, although overall predictive power remains limited due to the inherent randomness of accidents.
We fit Support Vector Machine (SVM) regression models to predict
log_price. We used two libraries: e1071 and
kernlab (via caret::train). Both models employ
a radial kernel to capture non-linear relationships in the data.
log_price target.80/20 into train/test (seeded for
reproducibility).e1071 svm tuned over a small cost/gamma grid with cross
validation. 3-fold cross validation inside tune() selects the best combo
on the training split.caret/kernlab + svmRadial tuned over a
small C/sigma grid with 3-fold cross validation.e1071 As you see on the heatmap, some areas have similar
green shades, indicating comparable performance across those parameter
combinations. However, the lowest RMSE for e1071 was achieved with
following parameters: cost=4, gamma=0.05 (CV RMSE=0.2846)
caret/kernlab And here you can see best kernlab
parameters: C=2, sigma=0.01 (CV RMSE=0.4298)
The radial Support Vector Machine from the e1071 package performs best, showing the smallest average prediction error (RMSE ≈ 0.29) and explaining about 87% of the variation in used-car prices, outperforming the alternative SVM model. SVMs do not yield straightforward coefficient interpretations; they learn support vectors and decision functions in a transformed feature space.
| .estimator | model | rmse | mae | rsq |
|---|---|---|---|---|
| standard | e1071_radial | 0.290 | 0.209 | 0.868 |
| standard | kernlab_radial | 0.354 | 0.251 | 0.803 |
Both models show strong agreement between predicted and true prices,
with points clustering closely around the diagonal and the smooth curve
largely overlapping the ideal line, indicating good calibration and
limited systematic bias. Slight over-prediction is visible for very
low-priced cars and mild under-prediction for the most expensive
vehicles, a common pattern in pricing problems. The e1071 radial SVM
exhibits marginally tighter clustering and smoother alignment with the
diagonal, consistent with its lower RMSE and higher R².
We fit two neural networks to predict log_price: a caret
nnet with preprocessing/tuning and a manual
neuralnet with a shallow hidden layer.
log_price.nnet: recipe with dummy variables + zero-variance
filter + centering/scaling; 3-fold cross validation over a small
size/decay grid; maxit=300.neuralnet: one hidden layer (4 units) on
scaled/dummified matrix, with optional subsampling for speed.report/models/nn.neuralnet uses fixed
architecture (no CV tuning).Best caret grid point: size = 5,
decay = 0.001 (CV RMSE ≈ 0.372). The white/black marker on
the heatmap highlights this combination.
The caret-based neural network performs better than the manually specified variant, achieving a smaller average prediction error (RMSE ≈ 0.37) and explaining about 79% of the variation in used-car prices (R² ≈ 0.79). The manually specified neural network shows weaker predictive performance, with larger errors (RMSE ≈ 0.41) and a lower proportion of explained variance (R² ≈ 0.74). Neural networks do not provide easily interpretable coefficients; instead, they learn complex non-linear relationships between vehicle characteristics and prices through layered transformations.
| .estimator | model | rmse | mae | rsq |
|---|---|---|---|---|
| standard | caret_nnet | 0.360 | 0.262 | 0.799 |
| standard | neuralnet_manual | 0.443 | 0.301 | 0.695 |
Both neural network models capture the strong relationship between observed and predicted log-transformed prices, indicating that the main structure of the data is well learned. Predictions are generally accurate across the central price range, where most observations lie, with limited overall bias.
The caret-based neural network shows more stable and consistent performance across different price levels. In contrast, the manually specified neural network exhibits greater variability in its predictions, particularly for higher-priced vehicles, where underestimation is more pronounced.
Overall, while both neural networks perform well, the caret
implementation delivers more reliable predictions across the full price
range and is therefore preferred in this analysis.
Both models use the same 80/20 train-test split (stored in
data/processed/train_test_split.rds). The table below
compares the linear regression on log_price to the
best-performing SVM.
| model | rmse | mae | r2 |
|---|---|---|---|
| Linear regression | 0.411 | 0.306 | 0.733 |
| SVM e1071_radial | 0.290 | 0.209 | 0.868 |
On the shared test set, the radial SVM clearly outperforms the linear regression model. It achieves substantially lower prediction errors (RMSE 0.290 vs 0.411, MAE 0.209 vs 0.306) and explains a larger share of the variation in log-prices (R² 0.868 vs 0.733).
While the linear model provides useful interpretability and captures the main price trends, it is limited by its linear structure. The SVM’s ability to model non-linear relationships results in more accurate predictions across the full price range. Overall, this comparison highlights the trade-off between interpretability (linear regression) and predictive accuracy (SVM), with the SVM being the preferred model for prediction.
(Binomial GLM vs. Binomial GAM)
Because accident_bin is a binary outcome, we compare
only models suitable for classification: the binomial GLM
(logistic regression) and the binomial GAM.
Both models were trained on the same feature set and evaluated on the
same train/test split.
| Model | Accuracy | AUC |
|---|---|---|
| Binomial GLM | 0.724 | 0.704 |
| Binomial GAM | 0.728 | 0.707 |
The binomial GAM slightly outperforms the GLM (Accuracy 0.728 vs. 0.724; AUC 0.707 vs. 0.704) by capturing nonlinear effects in the data. The GLM provides a simple, interpretable baseline, while the GAM offers a more flexible model that better fits the patterns in accident history.
This project provided hands-on experience with the full applied machine learning workflow, from raw data preparation to model comparison and interpretation. We learned that data cleaning and feature engineering are critical steps that strongly influence model performance, often more than the choice of model itself. In particular, transforming the price variable and carefully handling categorical predictors were essential for obtaining stable and meaningful results.
Through exploratory analysis, we gained insight into the structure of used-car prices and identified clear non-linear relationships with variables such as mileage, age, and brand. These insights helped guide model selection and highlighted the limitations of purely linear approaches for this type of data.
By fitting and comparing multiple models on a shared data split, we learned the importance of fair and consistent evaluation. Using identical predictors, response variables, and train–test splits ensured that performance differences reflected genuine modelling capabilities rather than artefacts of the evaluation setup.
The comparison of linear regression, neural networks, and support vector machines illustrated the trade-off between interpretability and predictive accuracy. Linear models are valuable for understanding price drivers, while more flexible machine learning models—especially the radial Support Vector Machine—deliver superior predictive performance by capturing complex, non-linear patterns.
Overall, this project reinforced that effective applied machine learning requires not only technical modelling skills, but also thoughtful data preparation, careful evaluation design, and clear communication of results aligned with the analysis objective.